Introduction

The term human well-being refers to peoples’ ability to live a life they value and can comprise cultural heritage, health, access to land and natural resources as well as more material factors such as income-generating opportunities. What constitutes human wellbeing differs for each group and would reflect its history, local culture and norms, political and socio-economic conditions, geography and ecological circumstances.

Discussions or research about wellbeing can therefore, reveal different perspectives, experiences, values, concerns and aspirations, which in turn can stimulate improved understanding of peoples’ changing relationships with nature and possible innovations in policies or processes to benefit both nature and people.

This project aims to study different ways wellbeing is calculated from an economics and development perspective.

PART 1 : GDP and its Components as a Measure of Material Wellbeing

The GDP data used for this project has been taken from the United Nations National Accounts Main Aggregates Database, which contains estimates of total GDP and its components for all countries over the period 1970 to 2019.

An attempt has been made to look at how GDP and its components have changed over time, and investigate the usefulness of GDP per capita as a measure of wellbeing.

The GDP data is a time series as well as a cross-sectional data.



Time Series GDP Data

Exploring the data

The variables Country and IndicatorName are row variables. They are required to be column variables (will call this dataset long_gdp). For this, a package called reshape2 has been used. The variable Country ID is not really needed so eliminate that first

wide_gdp <- gdp

wide_gdp = wide_gdp[, -1]

Change data from wide to long format

long_gdp = melt(wide_gdp, id.vars = c("Country", "IndicatorName"), 
                measure.vars = 4:53, value.vars=4: ncol(gdp))

head(long_gdp)
##       Country
## 1 Afghanistan
## 2 Afghanistan
## 3 Afghanistan
## 4 Afghanistan
## 5 Afghanistan
## 6 Afghanistan
##                                                                              IndicatorName
## 1                                                            Final consumption expenditure
## 2 Household consumption expenditure (including Non-profit institutions serving households)
## 3                                         General government final consumption expenditure
## 4                                                                  Gross capital formation
## 5       Gross fixed capital formation (including Acquisitions less disposals of valuables)
## 6                                                            Exports of goods and services
##   variable    value
## 1     1970 74842341
## 2     1970 69796803
## 3     1970  5045538
## 4     1970  4257383
## 5     1970  4257383
## 6     1970  7452582

The new format dataset is called long_gdp. During the reshaping process, a new variable called variable was created which contains years (can be seen in the above output box). The names function is used to rename it as Year.

names(long_gdp)[names(long_gdp) == "variable"] <- "Year"

Final consumption expenditure of each country is an important indicator taken into consideration while calculating GDP. Here, Final consumption expenditure is extracted using the subset function

cons = subset(long_gdp,
              IndicatorName == "Final consumption expenditure")

Now, to create a table showing the number of missing years by country

missing_by_country = cons %>%
  group_by(Country) %>%
  summarize(available_years=sum(!is.na(value))) 

datatable(missing_by_country)

How many of the 220 countries in the dataset have complete information? A dataset can be considered complete if it has the maximum number of available observations.

sum(missing_by_country$available_years == max(
  missing_by_country$available_years))
## [1] 179

179 out of 220 countries have data for the entire period. Data is missing for 41 countries.

Countries with missing data may have distinct characteristics compared with other countries. For example, Countries that are poorer do not generally have resources to collect data. Therefore, it is likely that for some years, data might not be available for all countries.



Calculation of GDP

There are three different approaches to calculate GDP viz, The Value Addded Approach, the Income Approach and, the Expenditure Approach. Here,the Expenditure Approach has been considered according to which,

GDP = Household consumption expenditure + General government final consumption expenditure + Gross capital formation + (Exports of goods and services − imports of goods and services)

Before moving to the analysis, shorten the names of the variables needed to calculate GDP

long_gdp$IndicatorName[long_gdp$IndicatorName == "Household consumption expenditure (including Non-profit institutions serving households)"] <- 
  "HH.Expenditure"

long_gdp$IndicatorName[long_gdp$IndicatorName == 
                        "General government final consumption expenditure"] <- 
  "Gov.Expenditure"

long_gdp$IndicatorName[long_gdp$IndicatorName == 
                        "Final consumption expenditure"] <-
  "Final.Expenditure"

long_gdp$IndicatorName[long_gdp$IndicatorName == 
                        "Gross capital formation"] <-
  "Capital"

long_gdp$IndicatorName[long_gdp$IndicatorName == 
                        "Imports of goods and services"] <-
  "Imports"

long_gdp$IndicatorName[long_gdp$IndicatorName == 
                        "Exports of goods and services"] <-
  "Exports"

Rather than looking at exports and imports separately, we usually look at the difference between them (exports minus imports), also known as Net Exports.

Add a new column for net exports i.e., (Exports - Imports)

This might be useful in later analysis

table_gdp$Net.Exports <- 
  table_gdp[, "Exports"]-table_gdp[, "Imports"]

Select three countries to check that net exports are calculated correctly: India , China and the US. (There is no specific reason as to why these countries have been chosen. This exercise was just done to check the values of Net exports that were calculated.)

sel_countries = c("India", "United States", "China")

Using the long format dataset, get imports, exports, and year for these countries. The following table shows all the observations for the selected countries and selected variables. The calculated Net Exports can also be seen in this table.

sel_gdp1 = subset(table_gdp, 
                 subset = (Country %in% sel_countries), 
                 select = c("Country", "Year", "Exports",
                            "Imports", "Net.Exports"))

datatable(sel_gdp1)


Now, aim is to plot charts to show the GDP components in order to look for general patterns over time and make comparisons between countries.

Here, the long_gdp dataset is used as the long format is well suited to produce charts with the ggplot package. India, China, United States and Burundi have been considered for analysis purposes. (Similar analysis can be done for other countries as well)

India and China are developing countries, USA is a developed country and Burundi comes under the category of underdeveloped countries. These countries have been chosen to to analyze trends in different categories of countries as defined by the United Nations.

India <- ggplot(subset(comp, Country == "India"),
       aes(x = Year, y = value)) + 
  geom_line(aes(group = IndicatorName, 
                color = IndicatorName), size = 2) + 
  scale_x_discrete(breaks=seq(1970, 2019, by = 5)) + 
  scale_y_continuous(name="Billion US$") + 
  ggtitle("GDP components over time for India") + 
 theme_solarized() + 
  scale_color_brewer(palette = "Set1") + 
  theme(text = element_text(size = 30, color = "black"))

China <- ggplot(subset(comp, Country == "China"),
       aes(x = Year, y = value)) + 
  geom_line(aes(group = IndicatorName, 
                color = IndicatorName), size = 2) + 
  scale_x_discrete(breaks=seq(1970, 2019, by = 5)) + 
  scale_y_continuous(name="Billion US$") + 
  ggtitle("GDP components over time for China") + 
  theme_solarized() + 
  scale_color_brewer(palette = "Set1") + 
  theme(text = element_text(size = 30, color = "black"))

USA <- ggplot(subset(comp, Country == "United States"),
       aes(x = Year, y = value)) + 
  geom_line(aes(group = IndicatorName, 
                color = IndicatorName), size = 2) + 
  scale_x_discrete(breaks=seq(1970, 2019, by = 5)) + 
  scale_y_continuous(name="Billion US$") + 
  ggtitle("GDP components over time for Unites States") + 
  theme_solarized() + 
  scale_color_brewer(palette = "Set1") + 
  theme(text = element_text(size = 30, color = "black"))

Burundi <- ggplot(subset(comp, Country == "Burundi"),
       aes(x = Year, y = value)) + 
  geom_line(aes(group = IndicatorName, 
                color = IndicatorName), size = 2) + 
  scale_x_discrete(breaks=seq(1970, 2019, by = 5)) + 
  scale_y_continuous(name="Billion US$") + 
  ggtitle("GDP components over time for Burundi") + 
  theme_solarized() + 
 scale_color_brewer(palette = "Set1") + 
  theme(text = element_text(size = 30, color = "black"))


grid.arrange(India, China, USA, Burundi, ncol=2, nrow =2)


Key Takeaways

  • China and India are both developing countries and are still growing. Before 2008, i.e., before the Financial Crisis, a steady growth in GDP can be observed in both the countries.

  • China’s economic growth started to accelerate in the late 1970s due to a stream of reforms to promote marketization, privatization, openness to trade, and other objectives.

  • India opened up its economy by introducing LPG (Liberalization, Privatization and Globalization) reforms in 1991 and that was when the growth started for India. Another observation is that India’s Government Expenditure witnessed a sharp rise after 1990-91. One of the reasons can be the adoption of LPG in the country, as mentioned above.

  • Increase in HH.Expenditure and Gov.Expenditure in India can be possible dude to many factors like population growth or increase in per-capita income.

  • China has a smoother growing Gross capital formation curve which indicates, and is a known fact, that China has invested more in capital formation right from the start as compared to India.

  • In December 2001, China joined the World Trade Organization, which accelerated its integration into the world economy. China’s comparative advantage in manufacturing due to its abundant and cheap labour has contributed to the rapid growth of net exports which remains a key driver of China’s growth today. The rapid economic growth has allowed all three components to increase simultaneously. The decentralization and the declining role of state-owned enterprises have contributed to the decreasing relative size of government expenditure.

  • In the data for the US, investment expenditure (HH.Expenditure) and consumption fall markedly in the financial crisis and both government spending and net export expenditure move in the opposite direction. China’s net exports fall markedly during the global financial crisis: this does not reflect changes in the Chinese economy, but the fall in expenditure in China’s major trading partners, including the US.

  • The United States, as a developed economy, has been growing at relatively slow rate between 1970 and 2016. The net exports component has been decreasing as production of many goods has moved to low-cost developing countries. Most of the gains in income are devoted to household consumption, which has been rising at a disproportionately high rate compared with the other components.

  • Burundi observes a sharp increase in HH.Expenditure around 2004-05. In 2019, household consumption for Burundi was 2,631 million US dollars. Though Burundi household consumption fluctuated substantially in recent years, it tended to increase through 1970 - 2019 period ending at 2,631 million US dollars in 2019.



Components as a Proportion of total GDP

Another way to visualize the GDP data is to look at each component as a proportion of total GDP. Using the same countries: India, China, United States and Burundi.

Calculating the proportion of total GDP

Using the comp dataset created earlier, first calculate the net exports as that contributes to GDP as well. The data is in long format. Will reshape it to wide format so that the variables needed are in separate columns instead of separate rows , calculate net exports, then transform the data back into long format.

Note: The net exports calculated here are not the same as the values calculated earlier. Here, all the variables are in terms of proportion of GDP.

# Reshape the data to wide format (indicators in columns)
comp_wide <- dcast(comp, Country + Year ~ IndicatorName)

head(comp_wide)
##   Country Year Capital Exports Gov.Expenditure HH.Expenditure Imports
## 1 Burundi 1970  0.9623  2.2702          2.0754        18.6360  2.4682
## 2 Burundi 1971  1.6831  1.8663          2.3417        19.3394  3.1353
## 3 Burundi 1972  0.6871  2.5341          2.7909        18.7999  3.2550
## 4 Burundi 1973  1.2937  2.6830          2.8210        20.8968  3.2472
## 5 Burundi 1974  2.4340  2.6519          3.4257        22.9891  4.2236
## 6 Burundi 1975  2.5130  2.7436          3.8278        29.8869  6.2996
# Add the new column for net exports = exports – imports

comp_wide$Net.Exports <- 
  comp_wide[, "Exports"] - comp_wide[, "Imports"]

head(comp_wide)
##   Country Year Capital Exports Gov.Expenditure HH.Expenditure Imports
## 1 Burundi 1970  0.9623  2.2702          2.0754        18.6360  2.4682
## 2 Burundi 1971  1.6831  1.8663          2.3417        19.3394  3.1353
## 3 Burundi 1972  0.6871  2.5341          2.7909        18.7999  3.2550
## 4 Burundi 1973  1.2937  2.6830          2.8210        20.8968  3.2472
## 5 Burundi 1974  2.4340  2.6519          3.4257        22.9891  4.2236
## 6 Burundi 1975  2.5130  2.7436          3.8278        29.8869  6.2996
##   Net.Exports
## 1     -0.1980
## 2     -1.2690
## 3     -0.7209
## 4     -0.5642
## 5     -1.5717
## 6     -3.5560

Now, a new dataframe props is created, also containing the proportions for each GDP component (proportion)

props = comp2 %>%
  group_by(Country, Year) %>%
  mutate(proportion = value / sum(value))

Plot line charts

india_prop <- ggplot(subset(props, Country == "India"),
       aes(x = Year, y = proportion)) + 
  geom_line(aes(group = variable, 
                color = variable), size = 2) + 
  scale_x_discrete(breaks=seq(1970, 2019, by = 5)) + 
  scale_y_continuous(name="Proportion of GDP (US$)") + 
  ggtitle("Share of GDP components over time for India (1970-2019)") + 
  theme_economist() + 
  scale_color_solarized() + 
  theme(text = element_text(size = 30))

china_prop <- ggplot(subset(props, Country == "China"),
       aes(x = Year, y = proportion)) + 
  geom_line(aes(group = variable, 
                color = variable), size = 2) + 
  scale_x_discrete(breaks=seq(1970, 2019, by = 5)) + 
  scale_y_continuous(name="Proportion of GDP (US$)") + 
  ggtitle("Share of GDP components over time for China (1970-2019)") + 
  theme_economist() + 
  scale_color_solarized() + 
  theme(text = element_text(size = 30))

us_prop <- ggplot(subset(props, Country == "United States"),
       aes(x = Year, y = proportion)) + 
  geom_line(aes(group = variable, 
                color = variable), size = 2) + 
  scale_x_discrete(breaks=seq(1970, 2019, by = 5)) + 
  scale_y_continuous(name="Proportion of GDP (US$)") + 
  ggtitle("Share of GDP components over time for USA (1970-2019)") + 
  theme_economist() + 
  scale_color_solarized() + 
  theme(text = element_text(size = 30))

burundi_prop <- ggplot(subset(props, Country == "Burundi"),
       aes(x = Year, y = proportion)) + 
  geom_line(aes(group = variable, 
                color = variable), size = 2) + 
  scale_x_discrete(breaks=seq(1970, 2019, by = 5)) + 
  scale_y_continuous(name="Proportion of GDP (US$)") + 
  ggtitle("Share of GDP components over time for Burundi (1970-2019)") + 
  theme_economist() + 
  scale_color_solarized() + 
  theme(text = element_text(size = 30))

grid.arrange(india_prop, china_prop, us_prop, burundi_prop, ncol=2, nrow =2)


Cross Sectional GDP Data

So far, the analysis of time series data has been done. Time series data is is a a collection of values for the same variables, taken at different points in time. An example would be the GDP data used for the analysis so far.

Along with being a time series data, the GDP dataset can also be used as a cross-sectional dataset for some more analysis and understanding material wellbeing.

Cross sectional data is a collection of values for the same variables for different subjects, usually taken at the same time.

Questions to be answered for the cross-sectional GDP data analysis

Choose three countries each that are developed, developing and in economic transition.

Source of list of countries that are developed, developing and in economic transition : UN Country Classification Document (https://www.un.org/development/desa/dpad/wp-content/uploads/sites/45/WESP2019_BOOK-ANNEX-en.pdf)

  1. The latest available data is for 2019. Calculate each component as a proportion of GDP for 2019.

  2. Try and find differences between spending patterns of developed, in economic transition and developing countries.


The following countries have been chosen for each category. Each country has been chosen randomly.

  • Developed countries: Germany, Japan, United States
  • In economic transition countries: Albania, Russian Federation, Ukraine
  • Developing countries: Brazil, China, India

The dataframe table_gdp created earlier has been used here.

Before selecting the countries from the dataframe, first the proportions of GDP have been calculated for each country.

table_gdp$p_HH.Exp <- 
  table_gdp$HH.Expenditure / 
  (table_gdp$Capital +
    table_gdp$Final.Expenditure +
    table_gdp$Net.Exports)

table_gdp$p_FinalExp <-
  table_gdp$Final.Expenditure / 
  (table_gdp$Capital +
    table_gdp$Final.Expenditure +
    table_gdp$Net.Exports)

table_gdp$p_Capital <- 
  table_gdp$Capital / 
  (table_gdp$Capital +
    table_gdp$Final.Expenditure +
    table_gdp$Net.Exports)


table_gdp$p_NetExports <-
  table_gdp$Net.Exports /
  (table_gdp$Capital +
    table_gdp$Final.Expenditure + 
    table_gdp$Net.Exports)

countries <- c("Germany", "Japan", "United States","Albania", 
               "Russian Federation", "Ukraine","Brazil", "China", "India")

# Using long format dataset, we select imports, 
# exports, and year for our chosen countries in 2019.

# Select the columns we need
sel_2019 <- 
  subset(table_gdp, subset =
    (Country %in% countries) & (Year == 2019),
    select = c("Country", "Year", "p_HH.Exp", "p_FinalExp",
      "p_Capital", "p_NetExports"))

Plotting chart

# Reshape the table into long format, then use ggplot

sel_2019_new <- melt(sel_2019, id.vars = 
  c("Year", "Country"))

#Specifying the order of the countries as in the countries object

sel_2019_new$Country <- factor(sel_2019_new$Country, levels = countries)

 ggplot(sel_2019_new, aes(x = Country, y = value, fill = variable)) +
  geom_bar(stat = "identity") + coord_flip() + 
  ggtitle("GDP component proportions in 2019 new") + 
  theme_wsj() + scale_fill_brewer(palette="Pastel1")

Side Note: even when a country has a trade deficit (proportion of net exports< 0), the proportions will add up to 1, but the proportions of final expenditure and capital will add up to more than 1.

Observations

  • Net exports form the least proportion of GDP in developing countries, followed by developed countries. Japan and Brazil have the least net export proportions. Countries in economic transition have more net exports proportion, Albania having the highest.

  • Developing countries spend more on capital formation as compared to developed countries. On the other hand, developed countries spend more on household expenditure and developing countries spend less comparatively.

  • Government final consumption expenditure is higher in developed countries followed by countries in economic transition and developing countries respectively.

Summary

GDP per capita is the total value of an economy’s output (and hence income) divided by the population over a period of time. It is a measure of material well-being by definition. The metric is based on the notion that market pricing for goods and services are good indicators of the amount of welfare provided to economic agents. GDP per capita is a simple concept to grasp and use. It is also a relatively objective metric. GDP per capita is the most extensively used metric, and vast sums of money have been spent to build the infrastructure needed to compute it.

However, even as a measure of material wellbeing, this metric does have it’s own set of limitations:

  • Goods and services such as child care are not accounted for by GDP per capita since they are not sold in markets and so do not have prices. As a result, GDP per capita understates material well-being.

  • Polluting economic activities are included in GDP, but the harm they do to material well-being is not. As a result, GDP per capita exaggerates material well-being.

  • Income earned by production in a country may not remain in that nation or be utilised by its citizens. Foreign corporations that repatriate revenues from their affiliates may take some of the money. This is compensated for by the metric gross national income per capita. GNI per capita can be considered a better metric.

  • GDP per capita is an average across the population and hence does not represent the country’s inequalities in material well-being. A country’s GDP per capita might be high when income is concentrated in a small percentage of the population.




PART 2: HDI as a Measure of Non-Material Wellbeing

In PART 1 we looked at GDP per-capita as a measure of material wellbeing. While income has a major influence on wellbeing because it allows us to buy the goods and services we need or enjoy, it is not the only determinant of wellbeing. Many aspects of our wellbeing cannot be bought, for example, good health or having more time to spend with friends and family.

In PART 2 of this project, the Human Development Index (HDI) dataset, a measure of wellbeing that includes non-material aspects, and make comparisons with GDP per capita (a measure of material wellbeing).

GDP per capita is a simple index calculated as the sum of its elements, whereas the HDI is more complex. Instead of using different types of expenditure or output to measure wellbeing or living standards, the HDI consists of three dimensions associated with wellbeing:

  • a long and healthy life (health)
  • knowledge (education)
  • a decent standard of living (income)

The HDI data used here is taken from United Nations Development Programme (UNDP) website. This data is for the year 2019.

Fun Fact! - Pakistani economist Mahbub ul Haq created HDI in 1990 which was further used to measure the country’s development by the United Nations Development Program (UNDP). Calculation of the index combines four major indicators: life expectancy for health, expected years of schooling, mean of years of schooling for education and Gross National Income per capita for standard of living


In theory, is each indicator used to calculate HDI a good measure of the dimension. Can there be better indicators?

  • A country’s gross national income is made up of its GDP plus income earned by citizens living in other countries, less money gained domestically by non-residents. Despite its shortcomings, GNP per capita is the best commonly available metric of living standards.

  • The expected and mean years of schooling are not the most excellent indicators of knowledge. Most notably, educational quality differs among countries.

  • Life expectancy, particularly in developing nations, can be influenced by quick fluctuations in infant and child mortality, and hence may not accurately reflect longevity. A longer life does not always imply a healthier or a happier life.

There might be better variables to calculate HDI. More research could help us know why!


Table giving Maximum and minimum values for each indicator in the HDI. This is provided by UNDP in technical notes (http://dev-hdr.pantheonsite.io/sites/default/files/hdr2016_technical_notes_0.pdf)

# Import dataset

hdi_values <- read_excel("HDI indicator values min and max.xlsx")

#Table

tab1 <- hdi_values %>%
  select(Dimension, Indicator, Max, Min) %>%
  gt() %>%
  tab_header(title = 
               "Maximum and Minimum Values for Each Indicator in the HDI")

tab1 <- tab1 %>% tab_options(table.background.color = "steelblue",
                             table.width = pct(100), data_row.padding = px(5),
                             table.font.color = "black") 

tab1
Maximum and Minimum Values for Each Indicator in the HDI
Dimension Indicator Max Min
Health Life expectancy at birth 85 20
Education Expected years of schooling (years) 18 0
Education Mean years of schooling (years) 15 0
Standard of living Gross National Income/capita (2011 PPP $) 75000 100

Calculation of HDI

Referring to the technical notes and the table constructed above, HDI has been calculated.

The HDI indicators are measured in different units and have different ranges, so in order to put them together into a meaningful index, normalization of the indicators is done by using the following formula:

Dimension index = (actual value − minimum value) / (maximum value − minimum value)

Doing so will give a value in the interval [0,1] which will allow comparison between different indicators.

Formulae used to calculate the indices:

  1. DI Health = (actual value − minimum value)/(maximum value − minimum value)

  2. DI Education :

avg(Expected years of schooling and Mean years of schooling)

Denote Expected years of schooling as EYS and Mean years of schooling as MYS

DI EYS = (actual value − minimum value)/(maximum value − minimum value)

DI MYS = (actual value − minimum value)/(maximum value − minimum value)

DI Education = avg(Expected years of schooling and Mean years of schooling)

  1. Standard of Living

DI Income = [ln(given value)]- ln(min)] / [ln(max)- ln(min)]

These three dimentional indices are combined to calculate the HDI, which the geometric mean of these three indices.

HDI = (DI Health x DI Education x DI Income)^(1/3)


Calculating HDI in R

head(HDI2019)
## # A tibble: 6 x 9
##   ...1   ...2     ...3      SDG3     SDG4.3   SDG4.4   SDG8.5     ...8     ...9 
##   <chr>  <chr>    <chr>     <chr>    <chr>    <chr>    <chr>      <chr>    <chr>
## 1 <NA>   <NA>     <NA>      <NA>     <NA>     <NA>     <NA>       <NA>     <NA> 
## 2 <NA>   <NA>     Human De~ Life ex~ Expecte~ Mean ye~ Gross nat~ GNI per~ HDI ~
## 3 HDI r~ country  Value     (years)  (years)  (years)  (2017 PPP~ <NA>     <NA> 
## 4 <NA>   <NA>     2019      2019     2019     2019     2019       2019     2018 
## 5 <NA>   VERY HI~ <NA>      <NA>     <NA>     <NA>     <NA>       <NA>     <NA> 
## 6 1      Norway   0.956999~ 82.4     18.06615 12.89775 66494.252~ 7        1
str(HDI2019)
## tibble [268 x 9] (S3: tbl_df/tbl/data.frame)
##  $ ...1  : chr [1:268] NA NA "HDI rank" NA ...
##  $ ...2  : chr [1:268] NA NA "country" NA ...
##  $ ...3  : chr [1:268] NA "Human Development Index (HDI)" "Value" "2019" ...
##  $ SDG3  : chr [1:268] NA "Life expectancy at birth" "(years)" "2019" ...
##  $ SDG4.3: chr [1:268] NA "Expected years of schooling" "(years)" "2019" ...
##  $ SDG4.4: chr [1:268] NA "Mean years of schooling" "(years)" "2019" ...
##  $ SDG8.5: chr [1:268] NA "Gross national income (GNI) per capita" "(2017 PPP $)" "2019" ...
##  $ ...8  : chr [1:268] NA "GNI per capita rank minus HDI rank" NA "2019" ...
##  $ ...9  : chr [1:268] NA "HDI rank" NA "2018" ...

The HDI data has many missing (NA) terms and many columns that do not contain data.

Cleaning the data

# Rename the first column, currently named X_1
names(HDI2019)[1] <- "HDI.rank"

# Rename the second column, currently named X_2 
names(HDI2019)[2] <- "country"

# Rename the last column, which contains the 2018 rank
names(HDI2019)[9] <- "HDI.rank.2018"

# Eliminate the row that contains the column title
HDI2019 <- subset(HDI2019,
  !is.na(HDI.rank) & HDI.rank != "HDI rank")

Eliminating columns that contain notes in the original spreadsheet (names starting with ‘X_’)

# Check which variables do NOT (!) start with X_
sel_columns <- !startsWith(names(HDI2019), "X_")

# Select the columns that do not start with X_
HDI2019 <- subset(HDI2019, select = sel_columns)

str(HDI2019)
## tibble [189 x 9] (S3: tbl_df/tbl/data.frame)
##  $ HDI.rank     : chr [1:189] "1" "2" "2" "4" ...
##  $ country      : chr [1:189] "Norway" "Ireland" "Switzerland" "Hong Kong S.A.R. of China" ...
##  $ ...3         : chr [1:189] "0.95699999999999996" "0.95499999999999996" "0.95499999999999996" "0.94899999999999995" ...
##  $ SDG3         : chr [1:189] "82.4" "82.31" "83.78" "84.86" ...
##  $ SDG4.3       : chr [1:189] "18.06615" "18.705290000000002" "16.328440000000001" "16.929469999999998" ...
##  $ SDG4.4       : chr [1:189] "12.89775" "12.666330500000001" "13.380812410000001" "12.279960000000001" ...
##  $ SDG8.5       : chr [1:189] "66494.252170000007" "68370.587369999994" "69393.520759999999" "62984.765529999997" ...
##  $ ...8         : chr [1:189] "7" "4" "3" "7" ...
##  $ HDI.rank.2018: chr [1:189] "1" "3" "2" "4" ...

Changing the names of some variables

names(HDI2019)[3] <- "HDI"
names(HDI2019)[4] <- "LifeExp"
names(HDI2019)[5] <- "ExpSchool"
names(HDI2019)[6] <- "MeanSchool"
names(HDI2019)[7] <- "GNI.capita"
names(HDI2019)[8] <- "GNI.HDI.rank"

While looking at the structure of the HDI2019 dataset, it can be seen that all the variables are considered as character variables. The variable Country is desired to be a factor variable and all other variables are desired to be numeric.

HDI2019$HDI.rank <- as.numeric(HDI2019$HDI.rank)
HDI2019$country <- as.factor(HDI2019$country)
HDI2019$HDI <- as.numeric(HDI2019$HDI)
HDI2019$LifeExp <- as.numeric(HDI2019$LifeExp)
HDI2019$ExpSchool <- as.numeric(HDI2019$ExpSchool)
HDI2019$MeanSchool <- as.numeric(HDI2019$MeanSchool)
HDI2019$GNI.capita <- as.numeric(HDI2019$GNI.capita)
HDI2019$GNI.HDI.rank <- as.numeric(HDI2019$GNI.HDI.rank)
HDI2019$HDI.rank.2018 <- as.numeric(HDI2019$HDI.rank.2018)
str(HDI2019)
## tibble [189 x 9] (S3: tbl_df/tbl/data.frame)
##  $ HDI.rank     : num [1:189] 1 2 2 4 4 6 7 8 8 10 ...
##  $ country      : Factor w/ 189 levels "Afghanistan",..: 127 82 165 75 77 65 164 9 122 47 ...
##  $ HDI          : num [1:189] 0.957 0.955 0.955 0.949 0.949 0.947 0.945 0.944 0.944 0.94 ...
##  $ LifeExp      : num [1:189] 82.4 82.3 83.8 84.9 83 ...
##  $ ExpSchool    : num [1:189] 18.1 18.7 16.3 16.9 19.1 ...
##  $ MeanSchool   : num [1:189] 12.9 12.7 13.4 12.3 12.8 ...
##  $ GNI.capita   : num [1:189] 66494 68371 69394 62985 54682 ...
##  $ GNI.HDI.rank : num [1:189] 7 4 3 7 14 11 12 15 6 2 ...
##  $ HDI.rank.2018: num [1:189] 1 3 2 4 4 4 7 7 9 10 ...

Now that the dataset is cleaned, further analysis can be continued.


Start by calculating the different indices.
HDI2019$I.Health <- 
  (HDI2019$LifeExp - 20) / (85 - 20)

HDI2019$I.Education <- 
  ((pmin(HDI2019$ExpSchool, 18) - 0) / 
  (18 - 0) + (HDI2019$MeanSchool - 0) / 
  (15 - 0)) / 2

HDI2019$I.Income <-
  (log(HDI2019$GNI.capita) - log(100)) /
  (log(75000) - log(100))

HDI2019$HDI.calc <- 
  (HDI2019$I.Health * HDI2019$I.Education * 
    HDI2019$I.Income)^(1/3)

Now, the HDI given in the dataset can be compared with the one that has been calculated.

compare <- HDI2019[, c("country", "HDI", "HDI.calc")]

datatable(compare)

It can be observed that the calculated HDI values are in fact equal to the HDI values provided in the dataset.


PART 3: Evaluating GDP Per Capita and HDI as measures of overall wellbeing

Ranking countries based on GDP per capita

Merged <- left_join(HDI2019, GDP_percapita, by = "country")

summary(Merged$`GDP per capita, PPP (constant 2017 international $)`)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    751.7   5040.1  13704.2  21138.8  30594.6 114323.4       33

There are a substantial number of missing observations for which the GDP per capita could have been used for analysis.

Now, we can rank countries based on GDP per capita.

# Creating a subset where all countries have HDI values and GDP per capita values 

Merged_sub <- 
  subset(Merged, !is.na(HDI) & !is.na(`GDP per capita, PPP (constant 2017 international $)`)) 
Now calculate the rank based on GDP per capita. Desired outcome: Assign rank 1 to the country with highest GDP, 2 to the next one and so on.
Merged_sub$GDP.pc.rank <-
  rank(-Merged_sub$`GDP per capita, PPP (constant 2017 international $)`,
       na.last = "keep")

gdp_hdi <- Merged_sub[, c("country", "HDI.rank", "GDP.pc.rank")]

datatable(gdp_hdi)

Plotting a scatterplot to compare the rank of countries based on HDI with that of GDP per capita. (Interactive plot)

p <- Merged_sub %>%  
  mutate(text = paste("country: ", country, "\nHDI rank: ", HDI.rank,
                      "\nGdp per capita rank: ",GDP.pc.rank, sep="")) %>% 
  ggplot(aes(x = HDI.rank, y = GDP.pc.rank, text= text, color= country)) +
  geom_point(shape = 19, alpha=1) +
  labs(y = "GDP per capita rank", x = "HDI rank") +
  ggtitle("Comparing ranks between HDI and GDP per capita") +
 theme_solarized_2(light=FALSE) + 
  scale_size(range = c(2, 24)) +
    theme(legend.position="none")

pp <- ggplotly(p, tooltip = "text")

pp

Observations

  • The graph shows a good correlation between the HDI rank and GDP per capita rank of different countries. This might be because the two indicators health and education are strongly correlated to GDP per capita.

  • There does not exist a perfect correlation between the two ranks as outliers can be seen. For example, Equatorial Guinea has a GDP per-capita rank of 60 but a HDI rank of 145.

  • The plot does not show HDI and GDP per capita ranks for all countries as many countries had missing data.

  • GDP per-capita measures the total value of all output in the economy per head of the population. HDI is a geometric mean of three dimension indices: health, education, and the natural log of income per capita.

  • Unlike GDP per-capita, the HDI takes into account the possibility that income might have diminishing marginal utility (that is, doubling your income might less than double your wellbeing). GDP per-capita, however, assumes that wellbeing increases one-for-one with income, and that income is the only factor that determines wellbeing, which is not the case.

There are many other ways of measuring wellbeing other than HDI and GDP per capita, strengths and weaknesses of HDI as a measure of wellbeing:

  • Strengths: The HDI is a more comprehensive measure of wellbeing than GDP per-capita since it takes into consideration education and health outcomes. The HDI is well-known and frequently utilized.

  • Weaknesses: As previously stated, schooling years and life expectancy may not be appropriate measures of knowledge and health. In HDI, all three dimensions are given equal weight. Countries may have differing perspectives on which factor is most essential.


Many other measures like unemployment, inequality , happiness, etc can be considered to measure HDI other than the already existing measures.

There are several subjective metrics of non-material well-being available, such as gross national happiness and gross national well-being. These measurements are based on surveys that directly ask individuals about their non-material well-being.

The next Section of this project analyzes the Global Happiness Index (2019).


PART 4: The Happiness Index

About the Report

The Gallup World Poll polls from 2016 to 2018 are used in the World Happiness Report 2019. They are based on responses to the poll’s primary life evaluation questions. The Cantril Ladder asks respondents to imagine a ladder, with 10 representing the best possible life and 0 representing the worst possible life. They are then asked to rank their current life on a scale of 0 to 10. For the years 2016-2018, the rankings are based on nationally representative samples. They are solely based on survey results, with the Gallup weights used to make the estimates representative.

The report looks at life assessments from 2016 to 2018, as well as gives yearly national rankings. These rankings are complemented by attempts to demonstrate how six key variables contribute to understanding the whole sample of national yearly average scores from 2005 to 2018. GDP per capita, social support, healthy life expectancy, freedom, generosity, and the absence of corruption are the variables. It is important to note that the report does not develop it’s own happiness measure in each nation using these six elements; rather, the scores are based on individuals’ own perceptions of their life, as represented by the Cantril Ladder. Rather, it utilises the six factors to explain how happiness differs between nations. The report also demonstrates how measurements of experienced well-being, particularly positive effect, might be used to augment living circumstances in explaining higher life evaluations.

About Dystopia plus residuals

Dystopia is a fictional country populated by the world’s unhappiest people. The goal of creating Dystopia is to create a standard against which all countries may be compared favourably (no nation performs worse than Dystopia) in terms of each of the six essential criteria, enabling each sub-bar to be of positive (or zero, in six cases) width. Dystopia is thus defined by the lowest scores found for the six key variables. Because life would be extremely miserable in a society with the world’s lowest wages, life expectancy, generosity, most corruption, least freedom, and least social support, it is referred to as “Dystopia,” as opposed to “Utopia.”

The residual, or unexplained components, differ by nation, demonstrating the amount to which the six factors either over- or under-explain average 2016-2018 life evaluations. Over the whole set of countries, these residuals have an average value of about zero. In Dystopia, these have been combined with the estimate for life assessments such that the combined bar always has positivity.Although certain life evaluation residuals are quite large, occasionally exceeding one point on a scale of 0 to 10, they are always far lower than the computed value in Dystopia, where the average life is assessed at 1.97 on a scale of 0 to 10.


NOTE: All the graphs in this section are interactive. One can hover around to know more!


Global Happiness Ladder - a Glance

This map gives an overview of the happiness ladder (ranks) of different countries. Ranks can be seen just by hovering over the map!

#Plot 

 highchart() %>%
  hc_add_series_map(
    worldgeojson, happy_2019, value = ("Ladder"), joinBy = c('name','country'),
    name = "Happiness Index Ladder")  %>% 
  hc_colorAxis(stops = color_stops(colors = 
                                     viridisLite::inferno(10, begin = 0.1))) %>% 
  hc_title(text = "Global Happiness Index 2019")
The data reflected in the above map, is also delineated in a tabular form below.
happy_ranks <- happy_2019 %>% group_by(country, Ladder, Ladder.score) %>% 
  select(country, Ladder, Ladder.score) %>%  mutate(Ladder.score = round(Ladder.score,3))

names(happy_ranks)[names(happy_ranks) == "country"] <- "Country"

datatable(happy_ranks)

Heat map to see the correlation of the variables with each other

heatmaply_cor(cor(happy_2019[, 4:10]),
  xlab = "Features", 
  ylab = "Features",
  colors = colorRampPalette(brewer.pal(3, "BuPu"))(256),
  k_col = 2, 
  k_row = 2) 

From the heat map, it can be seen that Ladder.score is highly correlated with Social Support (0.777) , Log GDP per capita (0.793) and Healthy Life Expectancy (0.779).


From the heatmap plotted above, it can be seen that Log GDP per capita and Ladder.score are highly correlated. It will be interesting to see how they vary across different regions.

Log GDP per-capita and Ladder.score
plot1 <- ggplot(happy_2019, 
  aes(x = Ladder.score, y= `Log GDP per capita`, 
      colour =Regional.indicator ,text = paste("country:", country))) +
  geom_point(show.legend = FALSE, alpha = 0.7) + scale_size(range = c(2, 12)) +
  scale_x_log10() + theme_igray()+ 
  scale_colour_brewer(type = "seq", palette = "Spectral") +
  labs(x = "Ladder score", y = "Log GDP per capita",
       caption = "Correlation between Log GDP per capita and Ladder Score Across Different Regions") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"))

fig1 <- ggplotly(plot1)
fig1
  • From the above graph it can be observed that Ladder.score and Log GDP per capita have a linear relationship. Most of the countries in Western Europe region score high in both in terms of Log GDP per capita and Happiness Scores. Sudan saw the lowest Happiness score and a Log GDP per capita of 0.306.

  • Somalia in Sub-Saharan Africa region scored the lowest in terms of Log GDP per capita score (0.00) and The Central African Republic scored the second lowest (0.026).

  • The countries in the Sub-Saharan Africa seem to have a larger spread of happiness scores and Log GDP scores. Mauritus is performing very well in both happiness scores (5.88) and Log GDP per capita scores (1.12). At the same time, there are also countries such as Benin which has a high happiness score (4.88) but relatively low Log GDp score score (0.47).


Social support and Ladder.score
plot2 <- ggplot(happy_2019, 
  aes(x = Ladder.score, y=`Social support`, 
      colour =Regional.indicator ,text = paste("country:", country))) +
  geom_point(show.legend = FALSE, alpha = 0.7) + scale_size(range = c(2, 12)) +
  scale_x_log10() + theme_igray()+ 
  scale_colour_brewer(type = "seq", palette = "Spectral") +
  labs(x = "Ladder score", y = "Social Support",
       caption = "Correlation between Social Support and Ladder Score Across Different Regions") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"))

fig2 <- ggplotly(plot2)
fig2
  • Social support and Ladder.score also show a linear relationship. Similar to GDP per capita, countries in the Westerm Europe region have higher social support.

  • Central African Republic in the Sub-Saharan Africa regions shows a social support score of 0. This is probably because of human rights violations and continuous unrest in the country due to social conflicts. Armed groups continue to commit serious human rights abuses, expanding their control to an estimated 70 percent of the country.


Healthy life expectancy and Ladder.score
plot3 <- ggplot(happy_2019, 
  aes(x = Ladder.score, y= `Healthy life expectancy` , 
      colour =Regional.indicator ,text = paste("country:", country))) +
  geom_point(show.legend = FALSE, alpha = 0.7) + scale_size(range = c(2, 12)) +
  scale_x_log10() + theme_igray()+ 
  scale_colour_brewer(type = "seq", palette = "Spectral") +
  labs(x = "Ladder score", y = "Healthy life expectancy",
       caption = "Correlation between Healthy life expectancy and Happiness Score Across Different Regions") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"))


fig3 <- ggplotly(plot3)
fig3
  • Western Europe and North America and ANZ region countries take the lead in having a healthy life expectancy.

  • Similar to the above two graphs, Sub-Saharan Africa region countries have lower life expectancy scores. These countries have a wider spread of Healthy life expectancy scores vs Ladder.scores.

  • As of 2019, most countries in the Sub-Saharn Africa region have seen Social and military unrest and the data indicates the same.


Top 10 and Bottom 10 Happiest Countries

In this section, we will look at the top ten and bottom happiest countries by looking at their Ladder scores. The arrange() function was applied to the new_dataset dataset to arrange the records by decreasing order of Ladder Score and head() function was applied to extract the top 10 and bottom 10 happiest countries respectively.

When plotting the dot plot, we wanted to show the countries in decreasing order of Ladder scores. Hence, the reorder(happy_2019,-Ladder.score) was applied to sort the countries in decreasing order of the Ladder score.

top10_Ladder.score <- ggplot(top10, aes(x= reorder(country,-Ladder.score), 
                        y=Ladder.score, fill=Regional.indicator))+
 geom_point( color="#C4961A", size=4, shape=18) +
 geom_segment( aes(x=reorder(country,-Ladder.score), 
                   xend=reorder(country,-Ladder.score), 
                   y=0, yend=Ladder.score), color="grey") +
 theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank()) +
  scale_fill_brewer(palette = "Set3") +
  xlab("Country") + ylab("Ladder Score")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 20))+
  ggtitle("Top 10 Countries with high Ladder Scores" )

ggplotly(top10_Ladder.score)

Observations

  • Most of the countries in the top 10 bracket according to their Ladder scores are from the Western Europe Region.
  • Finland takes the tag of the happiest country as per the ladder score.

bottom10<- tail(new_dataset,10) 

bottom_Ladder.score <- ggplot(bottom10, aes(x= reorder(country,-Ladder.score),
                           y=Ladder.score, fill=Regional.indicator))+
 geom_point( color="#00AFBB", size=4, shape=18) +
 geom_segment( aes(x=reorder(country,-Ladder.score),
                   xend=reorder(country,-Ladder.score), 
                   y=0, yend=Ladder.score), color="grey") +
 theme_minimal() +
 theme(
    plot.title = element_text(hjust = 0.5),
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank()
    ) +
  scale_fill_brewer(palette = "Set3")+
  xlab("Country") +
  ylab("Ladder Score")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 20)) + 
  ggtitle("Bottom 10 Countries with low Ladder Scores" )
  
ggplotly(bottom_Ladder.score)

Observations

  • Most of the countries that have low ladder scores are from the Sub-Sharan Africa region.

What is the Ladder score across the different regions?

It will be interesting to see Ladder scores across different regions. For this, interactive box plots have been plotted.

 box1 <- ggplot(happy_2019, aes(x=Regional.indicator, y = Ladder.score, 
                                fill = Regional.indicator)) +
  geom_boxplot() +  xlab("Regional Indicator") + ylab("Ladder Score")+
  theme_minimal()+
  geom_violin()+
  scale_fill_brewer(palette="Set3") + 
  stat_summary(geom = 'point', fun = 'mean', color='red')+ 
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"))

ggplotly(box1)

Key Takeaways

  • Countries in the North America and ANZ region and Western Europe region have the highest mean Ladder scores. However, the Western Europe region has the highest spread of the Ladder scores as compared to the North AMerica and ANZ region, with maximum ladder score 7.77 and minimum ladder score 5.29

  • Countries in the Sub Saharan Africa region observe the least mean ladder score.

  • Other regions with a high spread of ladder scores are the Latin America and Caribbean region with a maximum ladder score of 7.17 and a minimum ladder score of 3.60 ; the Middle East and North Africa region with maximum and minimum ladder scores 7.14 and 3.38 respectively.

  • Even though the Sub Saharan Africa region has the lowest mean ladder score, it observes a pretty high spread of ladder scores with maximum and minimum ladder scores 5.89 and 2.85 respectively.



What is the average life expectancy across the different regions?

The following boxplot is plotted to observe the average life expectancy across different regions.

box2 <- ggplot(happy_2019, aes(x=Regional.indicator, y = `Healthy life expectancy`))+
   geom_boxplot()+
   geom_violin(aes(fill=Regional.indicator))+
   theme_minimal()+
   stat_summary(geom = 'point', fun = 'mean', color='red')+
   scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
   theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black")) + 
  scale_fill_brewer(palette="Set3")

ggplotly(box2)

Key Takeaways

  • The average life expectancy score is the highest in Western Europe region (1.015) followed by the North America and ANZ region (0.993) and the East Asia region (0.953).

  • It can also be observed that Sub Saharn Africa region has the lowest average life expectancy score (4.075). This might be because of the fact that most countries in this region experience political and civic unrest and have limited resources to healthcare.


Some insights drawn from the exploratory data analysis of the happiness index data:

  • According to this data, the world’s happiest nations are predominantly located in Western Europe (particularly Northern Europe), North America, and Australia and New Zealand.It also indicated that the most significant element in determining a country’s happiness is its economy (GDP per capita). The happiest countries and world regions were those with robust and stable economies. The value of Economy is also closely associated with the value of Family and Health.

  • Greater economic stability and higher GDP per capita often foster stable and comfortable family life while also increasing the availability of appropriate medical resources and healthcare. These factors are thus given more weight when calculating overall happiness.

  • The three factors–Economy, Family, and Health–tend to be especially essential since they have a direct impact on the people who live in these nations. The status of the economy affects everyone, especially because it has direct influence over the availability and security of jobs, as well as the flow of money.

  • Families are the foundation of most people’s home lives, and health impacts people on an individual level as well. As a result, because they are extremely tangible characteristics, they have a greater effect on the happiness score as measured by individuals.

  • Sub-Saharan Africa and Southern Asia may use a boost, but the rest of the globe appears to be doing well. Let us look forward to the future!




PART 5: Linear Regression Analysis

For the regression anlaysis it will be interesting to see how GDP/GNI and HDI can affect Ladder score. The happiness index and HDI dataset have been joined for the regression analysis.

The dependent variable is Ladder.score. Since the ladder score has already been explained by the variables in the happiness dataset, this regression analysis will consider the variables in HDI dataset as independent variables. The iterative process has been shown below.

First Model

m1 <- Ladder.score ~ HDI + LifeExp + ExpSchool + MeanSchool + GNI.capita + I.Health + I.Education + I.Income 

reg1 <- lm(m1 , data = hdi_happiness)

reg1
## 
## Call:
## lm(formula = m1, data = hdi_happiness)
## 
## Coefficients:
## (Intercept)          HDI      LifeExp    ExpSchool   MeanSchool   GNI.capita  
##   3.309e+00    2.101e+01   -7.039e-02    3.119e-01    3.244e-01    1.454e-05  
##    I.Health  I.Education     I.Income  
##          NA   -1.757e+01   -5.199e+00
summary(reg1)
## 
## Call:
## lm(formula = m1, data = hdi_happiness)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83315 -0.36636  0.03775  0.45854  1.41846 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  3.309e+00  2.389e+00   1.385   0.1682  
## HDI          2.101e+01  2.157e+01   0.974   0.3317  
## LifeExp     -7.039e-02  9.695e-02  -0.726   0.4690  
## ExpSchool    3.119e-01  1.357e-01   2.298   0.0230 *
## MeanSchool   3.244e-01  1.868e-01   1.737   0.0847 .
## GNI.capita   1.454e-05  6.736e-06   2.158   0.0327 *
## I.Health            NA         NA      NA       NA  
## I.Education -1.757e+01  1.009e+01  -1.742   0.0837 .
## I.Income    -5.199e+00  7.785e+00  -0.668   0.5053  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6283 on 139 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.7062, Adjusted R-squared:  0.6914 
## F-statistic: 47.73 on 7 and 139 DF,  p-value: < 2.2e-16
plot(reg1)

  • It can be observed that The variable I.Health has missing values. Also, the education index and Income index have negative impact with Ladder.score.

  • For the next model, the HDI Indices can be eliminated from the model because there is a high chance of multicollinearity. The indices have been calculated using the other variables in the HDI dataset.

pred <- data.frame(prediction = predict(reg1, hdi_happiness), Ladder.score=hdi_happiness$Ladder.score) %>% mutate(error= prediction-Ladder.score)

pred2 <- pred %>% mutate(errorsq1=error^2)

pred2 <- na.omit(pred2)

sum(pred2$errorsq1)
## [1] 54.8695
one <- ggplot(pred2) + theme_solarized_2() +  geom_point(aes(x=prediction,y=Ladder.score),color="yellowgreen") +
  geom_abline(color = "orangered1") + theme(axis.line=element_line(),
                                      panel.grid.major=element_blank(),
                                      panel.grid.minor = element_blank(),
                                      panel.border = element_blank(),
                                      panel.background = element_blank())

ggplotly(one)

This model is not a good predictive model. This can clearly be seen from the above interactive plot. It is very scattered and has many outliers.



Second Model

m2 <- Ladder.score ~ HDI + LifeExp + ExpSchool + MeanSchool + GNI.capita

reg2 <- lm(m2 , data = hdi_happiness)

reg2
## 
## Call:
## lm(formula = m2, data = hdi_happiness)
## 
## Coefficients:
## (Intercept)          HDI      LifeExp    ExpSchool   MeanSchool   GNI.capita  
##   1.490e+00    4.392e+00    3.842e-03    4.489e-02   -5.650e-02    1.487e-05
summary(reg2)
## 
## Call:
## lm(formula = m2, data = hdi_happiness)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77820 -0.35322 -0.00368  0.43110  1.51395 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  1.490e+00  8.370e-01   1.780  0.07715 . 
## HDI          4.392e+00  2.628e+00   1.671  0.09690 . 
## LifeExp      3.842e-03  2.298e-02   0.167  0.86742   
## ExpSchool    4.489e-02  4.689e-02   0.957  0.34008   
## MeanSchool  -5.650e-02  5.737e-02  -0.985  0.32632   
## GNI.capita   1.487e-05  4.881e-06   3.046  0.00277 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6346 on 141 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.696,  Adjusted R-squared:  0.6852 
## F-statistic: 64.56 on 5 and 141 DF,  p-value: < 2.2e-16
  • The adjusted R-squared value for this regression model is 68.52%. It can be improved.

  • GNI.capita has a positive impact on Ladder.score and is statistically significant.



Third Model with modified dataset

mod_data <- read_excel("new hdi_happy.xlsx")

m3 <- Ladder.score ~ dystopia_residual + HDI + LifeExp + ExpSchool + log2_mean + ln.gni.capita

reg3 <- lm(m3 , data = mod_data)

reg3
## 
## Call:
## lm(formula = m3, data = mod_data)
## 
## Coefficients:
##       (Intercept)  dystopia_residual                HDI            LifeExp  
##          -7.65568            1.03666          -11.07686            0.08374  
##         ExpSchool          log2_mean      ln.gni.capita  
##           0.12711            0.29735            1.06786
summary(reg3)
## 
## Call:
## lm(formula = m3, data = mod_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07203 -0.20674  0.03659  0.24329  0.95709 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -7.65568    2.06639  -3.705 0.000315 ***
## dystopia_residual   1.03666    0.06128  16.917  < 2e-16 ***
## HDI               -11.07686    6.19648  -1.788 0.076244 .  
## LifeExp             0.08374    0.02940   2.848 0.005134 ** 
## ExpSchool           0.12711    0.06205   2.049 0.042574 *  
## log2_mean           0.29735    0.15635   1.902 0.059479 .  
## ln.gni.capita       1.06786    0.32686   3.267 0.001400 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.355 on 126 degrees of freedom
## Multiple R-squared:  0.9072, Adjusted R-squared:  0.9028 
## F-statistic: 205.4 on 6 and 126 DF,  p-value: < 2.2e-16
  • Adding the variable dystopia_residual increases the adjusted R-squared value.

  • Instead of the mean school of education, log of that variable has been considered here and it can be clearly observed that it gives better results.

  • Similarly, Log of GNI per capita has been used here which is a statistically significant as well.

  • HDI is negatively impacting the Ldder.score. In the next iteration, it will be interesting to see what happens if HDI is removed.



Fourth model: without HDI

m4 <- Ladder.score ~ dystopia_residual + LifeExp + ExpSchool + log2_mean + ln.gni.capita

reg4 <- lm(m4 , data = mod_data)

reg4
## 
## Call:
## lm(formula = m4, data = mod_data)
## 
## Coefficients:
##       (Intercept)  dystopia_residual            LifeExp          ExpSchool  
##          -4.03975            1.02591            0.03364            0.02279  
##         log2_mean      ln.gni.capita  
##           0.02613            0.49604
summary(reg4)
## 
## Call:
## lm(formula = m4, data = mod_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03264 -0.18837  0.03722  0.27305  0.96367 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4.039754   0.425926  -9.485  < 2e-16 ***
## dystopia_residual  1.025909   0.061509  16.679  < 2e-16 ***
## LifeExp            0.033637   0.008954   3.756 0.000261 ***
## ExpSchool          0.022793   0.021260   1.072 0.285708    
## log2_mean          0.026132   0.038089   0.686 0.493923    
## ln.gni.capita      0.496043   0.067743   7.322 2.47e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3581 on 127 degrees of freedom
## Multiple R-squared:  0.9049, Adjusted R-squared:  0.9011 
## F-statistic: 241.7 on 5 and 127 DF,  p-value: < 2.2e-16
  • Adding log2_mean to the model does not help. the adjusted r-squared value has decreased and the variable is not statistically significant.

  • In the next iteration, log2_mean has been removed.



Fifth model

m5 <- Ladder.score ~ dystopia_residual + LifeExp + ExpSchool +ln.gni.capita

reg5 <- lm(m5, data = mod_data)

reg5
## 
## Call:
## lm(formula = m5, data = mod_data)
## 
## Coefficients:
##       (Intercept)  dystopia_residual            LifeExp          ExpSchool  
##          -4.18284            1.02016            0.03371            0.02772  
##     ln.gni.capita  
##           0.51737
summary(reg5)
## 
## Call:
## lm(formula = m5, data = mod_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0552 -0.1923  0.0495  0.2616  1.0233 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4.182840   0.370600 -11.287  < 2e-16 ***
## dystopia_residual  1.020156   0.060808  16.777  < 2e-16 ***
## LifeExp            0.033710   0.008935   3.773 0.000246 ***
## ExpSchool          0.027722   0.019967   1.388 0.167445    
## ln.gni.capita      0.517368   0.060066   8.613 2.25e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3574 on 128 degrees of freedom
## Multiple R-squared:  0.9045, Adjusted R-squared:  0.9016 
## F-statistic: 303.2 on 4 and 128 DF,  p-value: < 2.2e-16
vif(reg5)
## dystopia_residual           LifeExp         ExpSchool     ln.gni.capita 
##          1.010995          4.889248          3.781968          5.450744
  • Modifying the model again, did not make much of a difference. All variables except LifeExp are statistically significant.

  • The adjusted R-squared value is similar to the R-squared values of the previous model.

-For the final iteration, ln.gni.capita is omitted. Instead, Log GDP per capita will be used. Also, HDI will now be included in the next model.



Sixth model

m6 <- Ladder.score ~ LifeExp + `Log GDP per capita` + HDI + ExpSchool + MeanSchool + Dystopia_residual

reg6 <- lm(m6 , data = hdi_happiness2)

reg6
## 
## Call:
## lm(formula = m6, data = hdi_happiness2)
## 
## Coefficients:
##          (Intercept)               LifeExp  `Log GDP per capita`  
##             -1.10585               0.07076               2.32734  
##                  HDI             ExpSchool            MeanSchool  
##             -6.75334               0.08773               0.11797  
##    Dystopia_residual  
##              1.03994
summary(reg6)
## 
## Call:
## lm(formula = m6, data = hdi_happiness2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91735 -0.22522  0.00868  0.24285  0.85277 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.10585    0.46468  -2.380 0.018820 *  
## LifeExp               0.07076    0.01867   3.790 0.000233 ***
## `Log GDP per capita`  2.32734    0.45942   5.066 1.41e-06 ***
## HDI                  -6.75334    3.22642  -2.093 0.038343 *  
## ExpSchool             0.08773    0.03754   2.337 0.021020 *  
## MeanSchool            0.11797    0.04944   2.386 0.018523 *  
## Dystopia_residual     1.03994    0.05801  17.928  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3393 on 126 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.9153, Adjusted R-squared:  0.9113 
## F-statistic:   227 on 6 and 126 DF,  p-value: < 2.2e-16
plot(reg6)

  • All the independent variables are statistically significant.

  • Ladder.score is best explained by the variables LifeExp , Log GDP per capita and Dystopia_residual.

  • The adjusted R-squared is 91.13%. This shows that this is a good model.

plot <- data.frame(prediction2 = predict(reg6, hdi_happiness2), Ladder.score=hdi_happiness2$Ladder.score) %>% mutate(error= prediction2-Ladder.score)

plot2 <- plot %>% mutate(error_sq=error^2)

plot2 <- na.omit(plot2)

sum(plot2$error_sq)
## [1] 14.50183
two <- ggplot(plot2) + geom_point(aes(x=prediction2,y=Ladder.score),color="yellowgreen") +
  geom_abline(color = "orangered1") + theme_solarized_2() +
  theme(axis.line=element_line(), panel.grid.major=element_blank(),
                                      panel.grid.minor = element_blank(),
                                      panel.border = element_blank(),
                                      panel.background = element_blank())

ggplotly(two)

The scatter plot (interactive) is way better than the first model. The final model proves to be a good predictive model for Ladder.score as seen from this graph.




Summing up and Comments

From the exploratory data anylasis of the Happiness Index data for 2019, it can be observed that economic growth, or GDP per-capita is the major contributor towards happiness. In PART 1 we concluded that material wellbeibg cannot be the sole contributor towards an individual’s wellbeing. However, after analysing both the HDI and the Happiness Index, it can be seen that GDP does play a huge role. It is often said that money can’t buy happiness. Ironically, we see through analysing the datasets here that this is not true. GDP per capita plays a major rule in determining global development and happiness.

The datasets used for this project are of the year 2019. Unfortunately, the GDP and HDI datasets were not available for the years 2020 and 2021. The years 2020 and 2021 experienced the Covid-19 pandemic. The pandemic has affected all the countries in terms of economy, development, happiness, etc. If the analysis was to be done on the latest 2020/2021 dataset, the results might have been different and it would be interesting to see whether happiness and dvelopment would still be dependent on the GDP of a country.

Finally, the regression models formed in this project indicate that GDP does play a major role in determining well being of people, be it material or non-material wellbeing. But again, as George E.P. Box said “All models are wrong, but some are useful.” The different variables and formuale used for measuring wellbeing are not the best or are not perfect, but yes, they do prove to be useful.

A new measure of Economic wellbeing has been introduced, called the Economic Complexity Index (ECI) by the Growth Lab of Harvard University. As a further extension to this project, it would be interesting to analyze and understand how the ECI has been calculated and how better of a measure it is of the economy and wellbeing of the people around the globe.